Analyzing Jokerace Submissions: Vision, Mission, and Strategic Priorities¶
Author: Omniacs.DAO
Case Study: From Raw Text to Actionable Insights with Python and OpenAI¶
How can you make sense of thousands of free-text submissions from the Arbitrum community? This is a common challenge for decentralized autonomous organizations (DAOs), companies, and projects that rely on community feedback to guide their strategy. Raw text is messy, unstructured, and time-consuming to analyze manually.
In this case study, we'll walk through a complete, real-world workflow for analyzing community proposal submissions from the Arbitrum DAO's Jokerace contests. We will use a powerful combination of classic NLP techniques and modern AI to extract meaningful themes and generate a high-level summary.
You will learn how to:
- Preprocess and Clean raw text data for analysis.
- Use N-gram Analysis to find common multi-word phrases.
- Apply Topic Modeling (LDA) to discover latent themes in the submissions.
- Leverage the OpenAI API to synthesize your findings into a concise, human-readable report.
1. Project Overview and Setup¶
Objective
Plurality Labs, founded by @DisruptionJoe and supported by his teammates @prose11 | GovAlpha and @radioflyerMQ were given the task to setup experimental governance mechanisms that would allow for formalized "sense-making" that will help the DAO come up with a strong mission and vision to guide their decisions as they then follow up with initiating the capital allocation process where grants will be given out to support the ARB ecosystem.
One of those experiments was an online incentivized survey hosted by "JokerRace" as part of the ThanksARB initiative of Arbitrum's "#GOVMonth". The Jokeraces consisted of 4 separate on-chain surveys where users were prompted to provide feedback on:
- ArbitrumDAO Short-term Strategic Priorities (Reduce Friction) -
0xbf47BDA4b172daf321148197700cBED04dbe0D58 - ArbitrumDAO Long-term Strategic Priorities (Growth and Innovation) -
0x5D4e25fA847430Bf1974637f5bA8CB09D0B94ec7 - ArbitrumDAO Northstar Strategic Framework: Vision -
0x0d4C05e4BaE5eE625aADC35479Cc0b140DDF95D4 - ArbitrumDAO Northstar Strategic Framework: Mission -
0x5a207fA8e1136303Fd5232E200Ca30042c45c3B6
The goal of this case-study is to analyze submissions across four Jokerace contests—Vision, Mission, Short-term priorities, and Long-term priorities—to:
- Identify key themes, concerns, and strategic ideas proposed by the community.
- Develop a reusable methodology for future text analysis.
- Create a summarized report of the findings.
Python Environment Setup
First, let's set up our Python environment by loading the necessary libraries. These packages cover everything from data manipulation (pandas) and text mining (nltk, scikit-learn) to API communication (requests).
To install the required packages, run:
pip install pandas scikit-learn nltk beautifulsoup4 requests
# --- Load Libraries ---
# For data manipulation
import pandas as pd
import numpy as np
import re
import os
# For text mining and NLP
from bs4 import BeautifulSoup
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# For interacting with APIs
import requests
import json
# For display purposes in Jupyter
from IPython.display import display, HTML
# --- Initial Setup ---
# Download necessary NLTK data (if not already downloaded)
try:
stopwords.words('english')
except LookupError:
nltk.download('stopwords')
# Initialize stemmer
stemmer = PorterStemmer()
Loading the Data
We'll use one main dataset:
propfinal: Contains the core proposal submissions, including the text content, author, and timestamp from JokerAce.
(Note: The original .RDS file has been converted to a more Python-friendly .csv format for this notebook.)
# Load Data
# The original R script used readRDS. We'll use pandas to read a CSV or Pickle file.
# Ensure the data file 'DataProposalsJokerAce.csv' is in a '../data/' directory relative to this notebook.
try:
propfinal = pd.read_csv('../data/DataProposalsJokerAce.csv')
except FileNotFoundError:
print("Error: Data file not found. Please ensure 'DataProposalsJokerAce.csv' is in the '../data/' directory.")
# Create a dummy dataframe for demonstration purposes if file doesn't exist
propfinal = pd.DataFrame({
'Content': ['<p>Example submission 1 about fixing bugs.</p>', 'Another entry about community growth.', 'Improve the user experience by reducing fees.'],
'Contract': ['0xbf47bda4b172daf321148197700cbed04dbe0d58', '0x5d4e25fa847430bf1974637f5ba8cb09d0b94ec7', '0xbf47bda4b172daf321148197700cbed04dbe0d58'],
'ContentParsed': ['Example submission 1 about fixing bugs.', 'Another entry about community growth.', 'Improve the user experience by reducing fees.']
})
display(propfinal.head())
| BlockTime | Contract | Address | Slug | URL | IsImage | Content | ContentParsed | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2023-09-06T09:12:57Z | 0xbf47bda4b172daf321148197700cbed04dbe0d58 | 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 | 5877857128534105584654838688249472014792481672... | https://jokerace.xyz/_next/data/k6jlVObdmnNyS2... | False | <p>Arbitrum DAO's short term is to enhance acc... | Arbitrum DAO's short term is to enhance access... |
| 1 | 2023-09-06T09:13:36Z | 0x5d4e25fa847430bf1974637f5ba8cb09d0b94ec7 | 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 | 3271481908788202063139200076303680035696251521... | https://jokerace.xyz/_next/data/k6jlVObdmnNyS2... | False | <p>Arbitrum DAO's long term is to establish a ... | Arbitrum DAO's long term is to establish a rob... |
| 2 | 2023-09-06T09:14:02Z | 0x0d4c05e4bae5ee625aadc35479cc0b140ddf95d4 | 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 | 7943498985727536752355302819044307985708513232... | https://jokerace.xyz/_next/data/k6jlVObdmnNyS2... | False | <p>Arbitrum DAO's vision is to revolutionize b... | Arbitrum DAO's vision is to revolutionize bloc... |
| 3 | 2023-09-06T09:14:34Z | 0x5a207fa8e1136303fd5232e200ca30042c45c3b6 | 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 | 2223490552549783591285349162279571232741760277... | https://jokerace.xyz/_next/data/k6jlVObdmnNyS2... | False | <p>Arbitrum DAO's mission is to drive innovati... | Arbitrum DAO's mission is to drive innovation,... |
| 4 | 2023-09-09T11:34:17Z | 0x5a207fa8e1136303fd5232e200ca30042c45c3b6 | 0x88496a9b5d7543148d9f8aecc9f69cb8437a4bae | 5855034030107174221175499644035301590398338140... | https://jokerace.xyz/_next/data/k6jlVObdmnNyS2... | False | <p>cxvxcvfds</p> | cxvxcvfds |
2. Data Cleaning and Preprocessing¶
Text data from the wild is messy. It's filled with HTML tags, punctuation, and inconsistent capitalization. To prepare it for analysis, we need to standardize it through a process called preprocessing. Our pipeline will:
- Remove HTML tags.
- Convert all text to lowercase.
- Remove punctuation.
- Stem words to their root form (e.g., "running", "ran", and "runs" all become "run"). This helps group related words.
- Trim excess whitespace.
# --- Clean and Preprocess Text ---
def clean_text(text):
# 1. Remove HTML tags
text = BeautifulSoup(text, "html.parser").get_text()
# 2. Convert to lowercase
text = text.lower()
# 3. Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# 4. Stem words
text = ' '.join([stemmer.stem(word) for word in text.split()])
# 5. Trim whitespace
return text.strip()
# Apply the cleaning function to the 'Content' column
propfinal_clean = propfinal['Content'].apply(clean_text)
# Display a before-and-after comparison
df_preview = pd.DataFrame({
'Raw': propfinal['Content'].head(),
'Cleaned': propfinal_clean.head()
})
display(df_preview)
| Raw | Cleaned | |
|---|---|---|
| 0 | <p>Arbitrum DAO's short term is to enhance acc... | arbitrum dao short term is to enhanc access us... |
| 1 | <p>Arbitrum DAO's long term is to establish a ... | arbitrum dao long term is to establish a robus... |
| 2 | <p>Arbitrum DAO's vision is to revolutionize b... | arbitrum dao vision is to revolution blockchai... |
| 3 | <p>Arbitrum DAO's mission is to drive innovati... | arbitrum dao mission is to drive innov foster ... |
| 4 | <p>cxvxcvfds</p> | cxvxcvfd |
This clean, stemmed text is now ready for more advanced analysis.
3. Exploratory Analysis: N-grams¶
What are the most common phrases in the submissions? While looking at single words (unigrams) is useful, n-grams (sequences of n words) give us more context. For example, the trigram "reduce transaction fees" is far more insightful than the individual words "reduce," "transaction," and "fees."
Let's find the most common trigrams (3-word phrases), excluding common "stop words" like "the," "a," and "is". We'll also add custom stop words specific to our dataset, like "arbitrum" and "dao," to filter out noise.
# --- N-gram Analysis ---
# Define stopwords
custom_stopwords = ['arbitrumdao', 'arbitrum', 'project', 'arb', 'dao']
stop_words = list(stopwords.words('english')) + custom_stopwords
# Use CountVectorizer to get trigram counts
vec = CountVectorizer(ngram_range=(3, 3), stop_words=stop_words).fit(propfinal_clean)
bag_of_words = vec.transform(propfinal_clean)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
trigram_table = sorted(words_freq, key = lambda x: x[1], reverse=True)
# Format as a DataFrame for display
trigram_df = pd.DataFrame(trigram_table, columns=['Trigram', 'Count'])
print("Top 15 Trigrams:")
display(trigram_df.head(15))
Top 15 Trigrams:
| Trigram | Count | |
|---|---|---|
| 0 | make wise decis | 212 |
| 1 | believ make wise | 211 |
| 2 | agre becaus believ | 207 |
| 3 | becaus believ make | 206 |
| 4 | read agre posit | 136 |
| 5 | agre posit thing | 131 |
| 6 | layer scale solut | 122 |
| 7 | creat anti sybil | 100 |
| 8 | shortterm strateg prioriti | 96 |
| 9 | love hope soon | 93 |
| 10 | longterm strateg prioriti | 86 |
| 11 | thi fantast great | 86 |
| 12 | agre make develop | 80 |
| 13 | think make develop | 79 |
| 14 | glad part thi | 79 |
This output immediately gives us a feel for recurring suggestions and ideas within the community.
4. Uncovering Themes with Topic Modeling (LDA)¶
While n-grams show us popular phrases, topic modeling helps us discover the underlying, latent themes across all submissions. We'll use Latent Dirichlet Allocation (LDA), a popular unsupervised algorithm that works by:
- Assuming each document is a mix of topics.
- Assuming each topic is a mix of words.
- Figuring out the "topics" (which are just clusters of words) that best explain the collection of documents.
The LDA Workflow
We'll create a repeatable workflow to process each contest's submissions.
Step 1: Define Helper Functions
We'll create functions to vectorize the text into a Document-Term Matrix (DTM) and run the LDA model. A DTM is a simple matrix where rows represent documents, columns represent terms (words), and the cells contain the word counts.
# --- LDA Helper Functions ---
def create_dtm(texts, stop_words_list, min_freq=5):
"""Creates a Document-Term Matrix from a list of texts."""
vectorizer = CountVectorizer(
stop_words=stop_words_list,
min_df=min_freq, # equivalent to global bounds in tm
preprocessor=lambda x: re.sub(r'\d+', '', x.lower()), # remove numbers and lowercase
tokenizer=lambda x: [stemmer.stem(w) for w in re.split(r'\W+', x) if w] # stem and tokenize
)
dtm = vectorizer.fit_transform(texts)
# Filter out documents that become empty after preprocessing
non_empty_docs_mask = np.asarray(dtm.sum(axis=1)).flatten() > 0
return dtm[non_empty_docs_mask, :], vectorizer, non_empty_docs_mask
def run_lda(dtm, K, iterations=50):
"""Runs the LDA algorithm."""
lda_model = LatentDirichletAllocation(
n_components=K,
random_state=1234,
max_iter=iterations,
learning_method='online' # similar to 'Gibbs'
)
lda_model.fit(dtm)
return lda_model
def get_top_terms(model, feature_names, n=20):
"""Extracts top terms from the LDA model."""
top_terms = {}
for topic_idx, topic in enumerate(model.components_):
top_terms[f"Topic_{topic_idx + 1}"] = [feature_names[i] for i in topic.argsort()[:-n - 1:-1]]
return pd.DataFrame(top_terms)
Step 2: Set Up and Run the Analysis Loop
Now, we define the contests we want to analyze and the number of topics (K) we want to find for each. Choosing K is a mix of art and science; it often requires experimentation to find a number that produces coherent, distinct topics.
The loop will perform the following for each contest:
- Subset the proposals for that contest.
- Perform final cleaning and filtering.
- Create the DTM using our helper function.
- Run the LDA model.
- Assign the most likely topic to each proposal.
- Extract and save the top words for each discovered topic.
# --- Define Contests and K (Number of Topics) ---
contest_data = [
{'Address': "0xbf47bda4b172daf321148197700cbed04dbe0d58", 'Name': "Reduce Friction", 'Slug': "RF4", 'Topics': 4},
{'Address': "0x5d4e25fa847430bf1974637f5ba8cb09d0b94ec7", 'Name': "Growth and Innovation", 'Slug': "GI5", 'Topics': 5},
{'Address': "0x0d4c05e4bae5ee625aadc35479cc0b140ddf95d4", 'Name': "Vision", 'Slug': "V7", 'Topics': 7},
{'Address': "0x0d4c05e4bae5ee625aadc35479cc0b140ddf95d4", 'Name': "Vision", 'Slug': "V10", 'Topics': 10},
{'Address': "0x5a207fa8e1136303fd5232e200ca30042c45c3b6", 'Name': "Mission", 'Slug': "M11", 'Topics': 11},
{'Address': "0x5a207fa8e1136303fd5232e200ca30042c45c3b6", 'Name': "Mission", 'Slug': "M15", 'Topics': 15}
]
contestdf = pd.DataFrame(contest_data)
# Create a directory to store outputs
output_dir = "../project1/temp/txts"
for slug in contestdf['Slug']:
os.makedirs(os.path.join(output_dir, slug), exist_ok=True)
# --- Main LDA Analysis Loop ---
all_results = []
propout = propfinal.copy()
for idx, contest in contestdf.iterrows():
print(f"Processing: {contest['Name']} ({contest['Slug']})")
# 1. Subset data for this contest
subset_df = propfinal[propfinal['Contract'] == contest['Address']].copy()
# 2. Clean and filter text
# Fix encoding, filter by length, and remove duplicates
subset_df['ContentParsed'] = subset_df['ContentParsed'].str.encode('ascii', 'ignore').str.decode('ascii')
subset_df = subset_df[subset_df['ContentParsed'].str.len() > 40]
subset_df = subset_df.drop_duplicates(subset=['ContentParsed'])
textdata = subset_df['ContentParsed']
if len(textdata) < contest['Topics']:
print(f" Skipping {contest['Slug']}: Not enough documents.")
continue
# 3. Create DTM
dtm, vectorizer, valid_mask = create_dtm(textdata, stop_words, min_freq=5)
docdf = subset_df.loc[valid_mask].copy()
if dtm.shape[0] < contest['Topics']:
print(f" Skipping {contest['Slug']}: Not enough documents after pruning.")
continue
# 4. Topic modeling
K = contest['Topics']
lda_model = run_lda(dtm, K)
topic_results = lda_model.transform(dtm)
docdf['Topic'] = topic_results.argmax(axis=1) + 1 # 1-based topic index
# 5. Annotate main dataframe with topic assignment
propout.loc[docdf.index, contest['Slug']] = docdf['Topic']
# 6. Save per-topic submissions as text files
for topic_num in docdf['Topic'].unique():
tempd = docdf[docdf['Topic'] == topic_num]
out_path = os.path.join(output_dir, contest['Slug'], f"Topic-{topic_num}.txt")
with open(out_path, 'w', encoding='utf-8') as f:
for i, row in tempd.iterrows():
f.write(f"Submission : {i}\n{row['ContentParsed']}\n\n\n")
# 7. Save top terms
top_terms_df = get_top_terms(lda_model, vectorizer.get_feature_names(), n=50)
top_terms_df.columns = [f"{contest['Slug']}_{col}" for col in top_terms_df.columns]
all_results.append(top_terms_df)
# Combine all topic term results into a single dataframe
if all_results:
results = pd.concat(all_results, axis=1)
print("\nLDA Analysis Complete. Top terms extracted:")
display(results.head())
else:
print("LDA Analysis did not produce any results.")
Processing: Reduce Friction (RF4)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', 'onc', 'onli', 'ourselv', 'themselv', 'thi', 'veri', 'wa', 'whi', 'yourselv'] not in stop_words.
warnings.warn('Your stop_words may be inconsistent with '
Processing: Growth and Innovation (GI5)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', 'onc', 'onli', 'ourselv', 'themselv', 'thi', 'veri', 'wa', 'whi', 'yourselv'] not in stop_words.
warnings.warn('Your stop_words may be inconsistent with '
Processing: Vision (V7)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', 'onc', 'onli', 'ourselv', 'themselv', 'thi', 'veri', 'wa', 'whi', 'yourselv'] not in stop_words.
warnings.warn('Your stop_words may be inconsistent with '
Processing: Vision (V10)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', 'onc', 'onli', 'ourselv', 'themselv', 'thi', 'veri', 'wa', 'whi', 'yourselv'] not in stop_words.
warnings.warn('Your stop_words may be inconsistent with '
Processing: Mission (M11)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', 'onc', 'onli', 'ourselv', 'themselv', 'thi', 'veri', 'wa', 'whi', 'yourselv'] not in stop_words.
warnings.warn('Your stop_words may be inconsistent with '
Processing: Mission (M15)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', 'onc', 'onli', 'ourselv', 'themselv', 'thi', 'veri', 'wa', 'whi', 'yourselv'] not in stop_words.
warnings.warn('Your stop_words may be inconsistent with '
LDA Analysis Complete. Top terms extracted:
| RF4_Topic_1 | RF4_Topic_2 | RF4_Topic_3 | RF4_Topic_4 | GI5_Topic_1 | GI5_Topic_2 | GI5_Topic_3 | GI5_Topic_4 | GI5_Topic_5 | V7_Topic_1 | ... | M15_Topic_6 | M15_Topic_7 | M15_Topic_8 | M15_Topic_9 | M15_Topic_10 | M15_Topic_11 | M15_Topic_12 | M15_Topic_13 | M15_Topic_14 | M15_Topic_15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | user | thi | thi | blockchain | develop | term | thi | blockchain | thi | thi | ... | blockchain | join | blockchain | thi | grow | commun | commun | mission | valu | mission |
| 1 | develop | commun | commun | secur | user | long | commun | secur | use | commun | ... | enabl | cryptocurr | chain | best | continu | vote | mission | decentr | term | network |
| 2 | network | futur | think | commun | blockchain | commun | great | creat | long | vision | ... | real | exchang | secur | veri | ye | mission | valu | world | long | use |
| 3 | term | term | like | ensur | fund | strateg | team | system | make | futur | ... | token | set | industri | great | put | reward | decis | vision | success | peopl |
| 4 | support | short | good | govern | ecosystem | prioriti | futur | sybil | need | veri | ... | ownership | earn | enhanc | good | choic | propos | ensur | futur | vision | make |
5 rows × 52 columns
By looking at the word clusters for each topic (e.g., RF4_Topic_1), we can assign a human-readable label. For instance, a topic with words like "communiti," "blockchain," "develop," "secur," and "support" is possibly about Governance.
5. Synthesizing Insights with the OpenAI API¶
We now have structured data: n-grams and topics. But we still need to produce a high-level summary. This is where an LLM like GPT-4o mini shines. We will ask the AI to act as an analyst, providing it with our topic keywords and the raw submissions, and instructing it to generate a synthesized report.
Step 1: Prepare the Prompt
The key to getting a good result from an LLM is a well-structured prompt. Our prompt will:
- Assign a role: "You are an expert blockchain community analyst."
- Provide context: Give it the top keywords from our LDA model.
- Provide the data: Give it the raw text submissions.
- Give clear instructions: Ask it to identify themes and find standout submissions.
- Specify the output format: Request clear headings for easy reading.
# --- OpenAI Summarization ---
# Let's focus on the "Reduce Friction" contest (RF4)
slug_to_summarize = 'RF4'
all_submissions = []
# Extract top words for the chosen contest
try:
rf4_cols = [col for col in results.columns if col.startswith(f"{slug_to_summarize}_Topic_")]
top_words_series = results[rf4_cols].stack().unique()
top_words = ', '.join(top_words_series)
# Load all submission text files for the chosen contest
contest_folder = os.path.join(output_dir, slug_to_summarize)
for filename in os.listdir(contest_folder):
if filename.endswith('.txt'):
with open(os.path.join(contest_folder, filename), 'r', encoding='utf-8') as f:
all_submissions.append(f.read())
submissions_text = '\n'.join(all_submissions)
submissions_text = submissions_text[:8000] # Truncate to avoid exceeding token limits
# Craft the prompt
prompt = f"""
You are an expert blockchain community analyst. Given the following user submissions and most common keywords, create a detailed summary for each of these categories:
1. Reduce Friction (overall)
2. Growth and Innovation
3. Vision
4. Mission
For each category, do the following:
- Identify and describe the main themes, concerns, and opportunities found in the submissions.
- Provide at least 2–3 representative or standout submissions (either as direct quotes or short paraphrases) that best capture the spirit or key insights of that category.
- Where relevant, highlight points of consensus, strong opinions, or recurring challenges and suggestions.
Most common keywords:
{top_words}
User submissions:
{submissions_text}
Format your output with clear headings for each category, and organize within each section as:
- Themes:
[Detailed synthesis]
- Standout submissions:
- [Quote or paraphrase 1]
- [Quote or paraphrase 2]
- [Quote or paraphrase 3]
If a submission is relevant to more than one category, you may include it in multiple sections.
"""
except (NameError, FileNotFoundError) as e:
print(f"Could not prepare prompt. Did the LDA analysis for '{slug_to_summarize}' run successfully?")
prompt = "" # Set prompt to empty to prevent API call
Step 2: Make the API Call
Next, we send our prompt to the OpenAI API using the requests package. For this exercise, you will need an API key from OpenAI.
Important: Never hardcode your API key directly in your script. For ease of use, load it from a separate, secure file (that you don't share or commit to version control). For a more advanced secure approach, store your API key as a system environment variable.
# --- Call OpenAI API ---
api_key = None
try:
# Load your API key securely from a file (e.g., a file named 'openai_key.txt' in a parent directory)
with open('../../openai_key.txt', 'r') as f:
api_key = f.read().strip()
except FileNotFoundError:
print("OpenAI API key file not found. Please create a file named 'openai_key.txt' in the parent directory.")
output_box_style = """
<style>
.output-box {
border: 2px solid #18bc9c;
background: #eafaf1;
padding: 12px;
margin: 1em 0;
border-radius: 6px;
font-size: 1.0em;
font-family: "Fira Mono", "Menlo", "Monaco", "Consolas", monospace;
white-space: pre-wrap;
}
</style>
"""
if api_key and prompt:
headers = {
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
payload = {
'model': 'gpt-4o-mini',
'messages': [{'role': 'user', 'content': prompt}],
'temperature': 0.5,
'max_tokens': 1500
}
try:
response = requests.post('https://api.openai.com/v1/chat/completions', headers=headers, json=payload)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
content = result['choices'][0]['message']['content']
display(HTML(f"{output_box_style}<div class='output-box'>{content}</div>"))
except requests.exceptions.RequestException as e:
error_content = e.response.text if e.response else str(e)
display(HTML(f"{output_box_style}<div class='output-box'>Error calling OpenAI API:<br>{error_content}</div>"))
else:
display(HTML(f"{output_box_style}<div class='output-box'>Skipping OpenAI API call. API key or prompt not available.</div>"))
6. Conclusion and Recommendations¶
By combining traditional NLP techniques with the power of modern LLMs, we have moved from thousands of raw text submissions to a concise, actionable summary of community feedback.
Key Findings
- Identified Core Themes: Using LDA, we successfully identified and categorized the main topics of conversation, such as governance, onboarding, and developer tooling.
- Pinpointed Specific Suggestions: N-gram analysis helped highlight specific, recurring phrases and proposals.
- Synthesized Actionable Insights: The OpenAI API provided a high-quality narrative summary, saving hours of manual reading and interpretation.
Recommendations for Future Analysis
- Improve Prompting: To get more structured data from the community, design submission forms with clearer, more focused prompts.
- Automate Filtering: Use regular expressions and readability metrics to automatically filter out low-quality or spam submissions.
- Integrate Voting Data: Correlate the topics of proposals with their voting outcomes to see which ideas gained the most traction.